CDH 5.3 Hadoop cluster using VirtualBox and QuickStart VM II : Testing
Continued from CDH 5.3 Hadoop cluster using VirtualBox and QuickStart VM, in this chapter, we'll test the QuickStart VM with a simple wordcount example:
[cloudera@quickstart ~]$ pwd /home/cloudera [cloudera@quickstart ~]$ mkdir temp [cloudera@quickstart ~]$ ls cloudera-manager Desktop eclipse Pictures Templates cm_api.sh Documents lib Public Videos datasets Downloads Music temp workspace [cloudera@quickstart ~]$ cd temp [cloudera@quickstart temp]$ ls [cloudera@quickstart temp]$ echo "If you torture the data long enough, it will confess." > wordcount.txt
First, we need to issue "ifconfig" command from VM:
[cloudera@quickstart ~]$ ifconfig eth0 Link encap:Ethernet HWaddr 08:00:27:7B:18:B1 inet addr:10.0.2.15 Bcast:10.0.2.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:302205 errors:0 dropped:0 overruns:0 frame:0 TX packets:172416 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:374769173 (357.4 MiB) TX bytes:21593675 (20.5 MiB) eth1 Link encap:Ethernet HWaddr 08:00:27:4E:45:86 inet addr:192.168.56.101 Bcast:192.168.56.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:5444 errors:0 dropped:0 overruns:0 frame:0 TX packets:1100 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:880947 (860.2 KiB) TX bytes:207516 (202.6 KiB) lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:35896200 errors:0 dropped:0 overruns:0 frame:0 TX packets:35896200 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:19029196779 (17.7 GiB) TX bytes:19029196779 (17.7 GiB)
Or "ip addr":
[cloudera@quickstart ~]$ ip addr 1: lo:mtu 16436 qdisc noqueue state UNKNOWN link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo 2: eth0: mtu 1500 qdisc pfifo_fast state UP qlen 1000 link/ether 08:00:27:7b:18:b1 brd ff:ff:ff:ff:ff:ff inet 10.0.2.15/24 brd 10.0.2.255 scope global eth0 3: eth1: mtu 1500 qdisc pfifo_fast state UP qlen 1000 link/ether 08:00:27:4e:45:86 brd ff:ff:ff:ff:ff:ff inet 192.168.56.101/24 brd 192.168.56.255 scope global eth1
We see "inet addr:192.168.56.101" in "eth1" part, and that's the ip to which we can ssh from Mac Terminal:
ip-192-168-1-48:.ssh kihyuckhong$ ssh cloudera@192.168.56.101 The authenticity of host '192.168.56.101 (192.168.56.101)' can't be established. RSA key fingerprint is 86:23:13:67:60:55:b8:d2:11:89:c8:a2:e4:db:4c:b0. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added '192.168.56.101' (RSA) to the list of known hosts. Connection closed by 192.168.56.101 ip-192-168-1-48:.ssh kihyuckhong$ ssh cloudera@192.168.56.101 cloudera@192.168.56.101's password: [cloudera@quickstart ~]$
Now, we're able to ssh into CentOS where Cloudera VM installed!
[cloudera@quickstart ~]$ hadoop Usage: hadoop [--config confdir] COMMAND where COMMAND is one of: fs run a generic filesystem user client version print the version jarrun a jar file checknative [-a|-h] check native hadoop and compression libraries availability distcp copy file or directories recursively archive -archiveName NAME -p * create a hadoop archive classpath prints the class path needed to get the credential interact with credential providers Hadoop jar and the required libraries daemonlog get/set the log level for each daemon or CLASSNAME run the class named CLASSNAME Most commands print help when invoked w/o parameters.
We can quickly check what do we have:
[cloudera@quickstart ~]$ hadoop fs -ls /user Found 8 items drwxr-xr-x - cloudera cloudera 0 2015-03-24 20:27 /user/cloudera drwxr-xr-x - hdfs supergroup 0 2015-03-14 20:11 /user/hdfs drwxr-xr-x - mapred hadoop 0 2015-03-15 14:08 /user/history drwxrwxrwx - hive hive 0 2014-12-18 04:33 /user/hive drwxrwxr-x - hue hue 0 2015-03-21 15:34 /user/hue drwxrwxrwx - oozie oozie 0 2014-12-18 04:34 /user/oozie drwxr-xr-x - sample sample 0 2015-03-14 22:05 /user/sample drwxr-xr-x - spark spark 0 2014-12-18 04:34 /user/spark
Put our input into hdfs:
[cloudera@quickstart temp]$ pwd /home/cloudera/temp [cloudera@quickstart temp]$ ls wordcount.txt [cloudera@quickstart temp]$ hdfs dfs -mkdir /user/cloudera/input [cloudera@quickstart temp]$ hdfs dfs -ls /user/cloudera/input [cloudera@quickstart temp]$ [cloudera@quickstart temp]$ hdfs dfs -put /home/cloudera/temp/wordcount.txt /user/cloudera/input [cloudera@quickstart temp]$ hdfs dfs -ls /user/cloudera/input Found 1 items -rw-r--r-- 1 cloudera cloudera 54 2015-03-15 17:24 /user/cloudera/input/wordcount.txt
We can also check the input file from UI:
Let's check the directory: /usr/lib/hadoop-mapreduce/:
[cloudera@quickstart temp]$ ls -ltr /usr/lib/hadoop-mapreduce/ ... lrwxrwxrwx 1 root root 44 Dec 18 04:25 hadoop-mapreduce-examples.jar -> hadoop-mapreduce-examples-2.5.0-cdh5.3.0.jar
Runs a jar file to see the Map Reduce code bundle in a jar file:
cloudera@quickstart temp]$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar An example program must be given as the first argument. Valid program names are: aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files. aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files. bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi. dbcount: An example job that count the pageview counts from a database. distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi. grep: A map/reduce program that counts the matches of a regex in the input. join: A job that effects a join over sorted, equally partitioned datasets multifilewc: A job that counts words from several files. pentomino: A map/reduce tile laying program to find solutions to pentomino problems. pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method. randomtextwriter: A map/reduce program that writes 10GB of random textual data per node. randomwriter: A map/reduce program that writes 10GB of random data per node. secondarysort: An example defining a secondary sort to the reduce. sort: A map/reduce program that sorts the data written by the random writer. sudoku: A sudoku solver. teragen: Generate data for the terasort terasort: Run the terasort teravalidate: Checking results of terasort wordcount: A map/reduce program that counts the words in the input files. wordmean: A map/reduce program that counts the average length of the words in the input files. wordmedian: A map/reduce program that counts the median length of the words in the input files. wordstandarddeviation: A map/reduce program that counts the standard deviation of the length of the words in the input files.
We can locate the code we need : the 4th one from the bottom:
wordcount: A map/reduce program that counts the words in the input files.
[cloudera@quickstart temp]$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar wordcount /user/cloudera/input/wordcount.txt /user/cloudera/output 15/03/15 17:49:01 INFO client.RMProxy: Connecting to ResourceManager at quickstart.cloudera/127.0.0.1:8032 15/03/15 17:49:02 INFO input.FileInputFormat: Total input paths to process : 1 15/03/15 17:49:02 INFO mapreduce.JobSubmitter: number of splits:1 15/03/15 17:49:03 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1426453727985_0001 15/03/15 17:49:03 INFO impl.YarnClientImpl: Submitted application application_1426453727985_0001 15/03/15 17:49:03 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1426453727985_0001/ 15/03/15 17:49:03 INFO mapreduce.Job: Running job: job_1426453727985_0001 15/03/15 17:49:21 INFO mapreduce.Job: Job job_1426453727985_0001 running in uber mode : false 15/03/15 17:49:21 INFO mapreduce.Job: map 0% reduce 0% 15/03/15 17:49:37 INFO mapreduce.Job: map 100% reduce 0% 15/03/15 17:49:48 INFO mapreduce.Job: map 100% reduce 100% 15/03/15 17:49:48 INFO mapreduce.Job: Job job_1426453727985_0001 completed successfully 15/03/15 17:49:48 INFO mapreduce.Job: Counters: 49 File System Counters FILE: Number of bytes read=128 FILE: Number of bytes written=217765 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=184 HDFS: Number of bytes written=74 HDFS: Number of read operations=6 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 Job Counters Launched map tasks=1 Launched reduce tasks=1 Data-local map tasks=1 Total time spent by all maps in occupied slots (ms)=13334 Total time spent by all reduces in occupied slots (ms)=5741 Total time spent by all map tasks (ms)=13334 Total time spent by all reduce tasks (ms)=5741 Total vcore-seconds taken by all map tasks=13334 Total vcore-seconds taken by all reduce tasks=5741 Total megabyte-seconds taken by all map tasks=13654016 Total megabyte-seconds taken by all reduce tasks=5878784 Map-Reduce Framework Map input records=1 Map output records=10 Map output bytes=94 Map output materialized bytes=124 Input split bytes=130 Combine input records=10 Combine output records=10 Reduce input groups=10 Reduce shuffle bytes=124 Reduce input records=10 Reduce output records=10 Spilled Records=20 Shuffled Maps =1 Failed Shuffles=0 Merged Map outputs=1 GC time elapsed (ms)=737 CPU time spent (ms)=730 Physical memory (bytes) snapshot=389378048 Virtual memory (bytes) snapshot=1715568640 Total committed heap usage (bytes)=303366144 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=54 File Output Format Counters Bytes Written=74 [cloudera@quickstart tem
Here are the log files:
[cloudera@quickstart temp]$ hdfs dfs -ls /user/cloudera/output Found 2 items -rw-r--r-- 1 cloudera cloudera 0 2015-03-15 17:49 /user/cloudera/output/_SUCCESS -rw-r--r-- 1 cloudera cloudera 74 2015-03-15 17:49 /user/cloudera/output/part-r-00000
We can check the output directory using UI:
Now, let's see what's in the output: part-r-00000:
[cloudera@quickstart temp]$ hdfs dfs -cat /user/cloudera/output/part-r-00000 If 1 confess. 1 data 1 enough, 1 it 1 long 1 the 1 torture 1 will 1 you 1
Ph.D. / Golden Gate Ave, San Francisco / Seoul National Univ / Carnegie Mellon / UC Berkeley / DevOps / Deep Learning / Visualization